Skip to content

*: reduce getGCState freq#10817

Open
CalvinNeo wants to merge 4 commits intopingcap:masterfrom
CalvinNeo:fix-gc-safepoint
Open

*: reduce getGCState freq#10817
CalvinNeo wants to merge 4 commits intopingcap:masterfrom
CalvinNeo:fix-gc-safepoint

Conversation

@CalvinNeo
Copy link
Copy Markdown
Member

@CalvinNeo CalvinNeo commented Apr 24, 2026

What problem does this PR solve?

Issue Number: close #10818

Problem Summary:

What is changed and how it works?


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Summary by CodeRabbit

  • Performance

    • Introduced configurable GC safepoint fetch strategies to reduce external calls and improve predictability, including a cache-only mode that serves cached safepoints without contacting the metadata service.
  • Behavior

    • Safepoint checks now support a cache-only read mode by default for query paths to avoid unexpected remote fetches during execution.
  • Tests

    • Added tests validating safepoint caching behavior and cache-only read paths.

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 24, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3f8c94bc-5a81-44ba-a5ca-6a22489eec2c

📥 Commits

Reviewing files that changed from the base of the PR and between 7aa11f4 and 6acf181.

📒 Files selected for processing (7)
  • dbms/src/Debug/dbgFuncSchema.cpp
  • dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalBg.cpp
  • dbms/src/Storages/KVStore/MultiRaft/PrehandleSnapshot.cpp
  • dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
  • dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp
  • dbms/src/Storages/StorageDeltaMerge.cpp
  • dbms/src/TiDB/Schema/SchemaSyncService.cpp
💤 Files with no reviewable changes (4)
  • dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalBg.cpp
  • dbms/src/Debug/dbgFuncSchema.cpp
  • dbms/src/TiDB/Schema/SchemaSyncService.cpp
  • dbms/src/Storages/KVStore/MultiRaft/PrehandleSnapshot.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h

📝 Walkthrough

Walkthrough

Adds a fetch-strategy enum and updates PDClientHelper::getGCSafePointWithRetry to accept GCSafepointFetchStrategy, changes several call sites to use CacheOnly where appropriate, and adds tests verifying cache-only and cache-with-refresh behaviors.

Changes

Cohort / File(s) Summary
GC Safepoint API
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
Adds enum class GCSafepointFetchStrategy { CacheOnly, UpdateCacheIfNeeded } and changes PDClientHelper::getGCSafePointWithRetry signature to accept fetch_strategy; implements cache-only vs update logic.
Tests
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp
Adds CountingPDClient test double and two PDClientHelperTest cases validating CacheOnly behavior on miss, cached reads, staleness, and cache-refresh when using non-cache fetch.
Call-site updates
dbms/src/Storages/StorageDeltaMerge.cpp, dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalBg.cpp, dbms/src/Storages/KVStore/MultiRaft/PrehandleSnapshot.cpp, dbms/src/TiDB/Schema/SchemaSyncService.cpp, dbms/src/Debug/dbgFuncSchema.cpp
Removes previous boolean ignore_cache arguments; several callers now pass GCSafepointFetchStrategy::CacheOnly or rely on default UpdateCacheIfNeeded. Minor comment additions; no other logic changes.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as "Caller\n(Storage / Debug / DeltaMerge)"
    participant Helper as "PDClientHelper"
    participant Cache as "Local Cache\n(ks_gc_sp_map)"
    participant PD as "PD Server\n(pd_client)"

    Caller->>Helper: getGCSafePointWithRetry(keyspace, fetch_strategy)
    Helper->>Cache: lookup cached safepoint for keyspace
    alt fetch_strategy == CacheOnly
        Cache-->>Helper: cached value or miss
        Helper-->>Caller: return cached value (or 0 on miss)
    else fetch_strategy == UpdateCacheIfNeeded
        Cache-->>Helper: cached value (may be expired)
        alt cached valid within interval
            Helper-->>Caller: return cached value
        else cached absent/expired
            Helper->>PD: getGCState(...) (with retry/backoff)
            PD-->>Helper: gc_safe_point
            Helper->>Cache: update cache if gc_safe_point != 0
            Helper-->>Caller: return gc_safe_point
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

severity/minor

Suggested reviewers

  • JinheLin
  • JaySon-Huang
  • kolafish

Poem

🐇 I hop through cache and PD with care,
A safepoint nibble hidden there.
Cache-only waits when servers sleep,
Updates fetch the hill so steep.
Hop—safe rows rest, my watch runs deep.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is largely incomplete: the problem summary is empty, commit message section is empty, test checkboxes are all unchecked, and side effects/documentation sections are all unchecked despite substantive changes. Fill in the problem summary explaining the getGCState frequency issue, add commit message details about the strategy change, and explicitly indicate which tests were added or why none were needed.
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'reduce getGCState freq' clearly summarizes the primary change, which is reducing the frequency of getGCState calls through a caching strategy modification.
Linked Issues check ✅ Passed The PR fully addresses the objective stated in linked issue #10818 by introducing a GCSafepointFetchStrategy enum that allows CacheOnly reads, reducing getGCState calls via selective caching behavior across multiple call sites.
Out of Scope Changes check ✅ Passed All code changes are directly scoped to reducing getGCState frequency: API modification for fetch strategy control, test coverage for cache-only reads, and caller updates aligning with the new caching semantics.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dbms/src/Storages/StorageDeltaMerge.cpp (1)

719-745: ⚠️ Potential issue | 🟠 Major

checkStartTs becomes a no-op on cache miss — this is intentional but needs explicit documentation.

checkStartTs is the safety net that rejects queries whose start_ts is below the GC safepoint. With GCSafepointFetchStrategy::CacheOnly, getGCSafePointWithRetry returns 0 on cache miss (confirmed in PDTiKVClient.h:200-208), making the comparison start_ts < 0 always false — the check is silently skipped.

The design is intentional: background/non-query callers (SchemaSyncService with ignore_cache=true, PrehandleSnapshot, DeltaMergeStore_InternalBg) populate the cache via PD, while query paths consume only the cached value to avoid per-query PD traffic. This is validated by explicit test coverage (CacheOnlyReadPathDoesNotFetchFromPD).

However, a startup window exists: if a query arrives before any background path has populated the cache for a given keyspace (fresh TiFlash process or a new keyspace that none of the background tasks have touched yet), the safety check is bypassed entirely. This trade-off between startup safety and steady-state performance should be:

  1. Explicitly documented in the commit message or PR description — the current log message suggests the default behavior is preserved, which is misleading.
  2. Confirmed acceptable by the authors — either the startup window is guaranteed short in practice (e.g., schema sync always runs before first query), or the risk is acceptable by design.

Consider adding a note in the code or PR explaining why this trade-off is acceptable at TiFlash startup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/StorageDeltaMerge.cpp` around lines 719 - 745, The
checkStartTs function uses PDClientHelper::getGCSafePointWithRetry(...,
GCSafepointFetchStrategy::CacheOnly) (called from checkStartTs) which returns 0
on cache miss, effectively making the start_ts < safe_point check a no-op during
cold starts; add an explicit in-code comment above checkStartTs (or adjacent to
the PDClientHelper call) stating that CacheOnly yields 0 on miss, that
background callers (SchemaSyncService, PrehandleSnapshot,
DeltaMergeStore_InternalBg) are responsible for populating the cache, and
document the startup window trade-off and why it is acceptable (or link to the
PR/issue) so reviewers/readers are not misled by the current behavior/logging.
🧹 Nitpick comments (3)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp (1)

1268-1268: Wall-clock sleep adds mild flakiness.

sleep_for(2s) combined with safe_point_update_interval_seconds=1 is fine in practice, but makes this test timing-dependent and slow. If you ever want to make it deterministic, the cache uses steady_clock inside getGCSafepointIfValid, so a custom clock injection or exposing an "expire now" hook would eliminate the sleep. Not required for this PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp` at line 1268, The test
uses a wall-clock sleep (std::this_thread::sleep_for(std::chrono::seconds(2)))
which makes it timing-dependent and slow; replace this with a deterministic
approach by injecting a test clock or adding an "expire now" hook so the
cache/GC safepoint logic can be advanced without sleeping — target the code
paths around getGCSafepointIfValid and the safe_point_update_interval_seconds
behavior (use a steady_clock-injectable implementation or call an exposed method
to force expiry) so the test can trigger the same cache refresh immediately and
remove the sleep.
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)

198-225: Consider whether the CacheOnly path should emit tiflash_gc_safepoint_backoff_count{type=success}.

In the CacheOnly branch no PD request is ever made, so calling observe_backoff_count(true) (with backoff_count == 0) inflates the type_success histogram with samples that do not correspond to an actual PD fetch. Previously every observation in that metric reflected a real PD interaction; after this change the vast majority of observations will come from cache-only lookups on the hot read path and the metric will mostly report zeros. If the intent of the metric is to track PD-fetch backoff, consider skipping the observation on the CacheOnly cache-hit/miss path (and also skipping it on the existing fast-path cache hit).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 198 - 225,
The CacheOnly branch and the fast-path cached-hit branch currently call
observe_backoff_count(true) even though no PD request occurs; remove those calls
so the tiflash_gc_safepoint_backoff_count{type=success} metric is only observed
when an actual PD fetch/backoff happens. Specifically, in PDTiKVClient.h inside
the code paths handling GCSafepointFetchStrategy::CacheOnly and the
getGCSafepointIfValid(cache-hit) branch, eliminate the
observe_backoff_count(true) calls and ensure observe_backoff_count is only
invoked in the code path(s) that perform a real PD fetch (the fallback PD-fetch
logic). Use the existing symbols ks_gc_sp_map.getGCSafepoint,
ks_gc_sp_map.getGCSafepointIfValid, and observe_backoff_count to locate and
adjust the calls.
dbms/src/Storages/StorageDeltaMerge.cpp (1)

915-915: checkStartTs is now invoked three times per read; with CacheOnly this is effectively identical to a single call.

Previously each checkStartTs call could trigger a PD fetch when the cache was stale, so invoking it pre-read / post-read / post-snapshot gave each call an independent chance to pick up a freshly advanced safepoint. With CacheOnly all three calls read the same cached value (unless a background path races in between), so the "ensure after read" invariant the comments describe no longer adds meaningful coverage. Worth a short note in the PR or code comments that the post-read checks are kept as a defense-in-depth against a future background refresh between the two calls, rather than active safety.

Also applies to: 1011-1011, 1057-1057

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/StorageDeltaMerge.cpp` at line 915, Multiple calls to
checkStartTs(mvcc_query_info.start_ts, context, query_info.req_id, keyspace_id)
now read the same cached safepoint under CacheOnly, so the post-read invocations
no longer increase coverage; update the code comment near the checkStartTs calls
(the pre-read/post-read/post-snapshot invocations) to state explicitly that with
CacheOnly the checks observe the same cached value and that the extra calls are
retained as defense-in-depth only (to catch a rare background refresh/race),
rather than for active additional safety, so future readers understand why we
keep the redundant calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@dbms/src/Storages/StorageDeltaMerge.cpp`:
- Around line 719-745: The checkStartTs function uses
PDClientHelper::getGCSafePointWithRetry(...,
GCSafepointFetchStrategy::CacheOnly) (called from checkStartTs) which returns 0
on cache miss, effectively making the start_ts < safe_point check a no-op during
cold starts; add an explicit in-code comment above checkStartTs (or adjacent to
the PDClientHelper call) stating that CacheOnly yields 0 on miss, that
background callers (SchemaSyncService, PrehandleSnapshot,
DeltaMergeStore_InternalBg) are responsible for populating the cache, and
document the startup window trade-off and why it is acceptable (or link to the
PR/issue) so reviewers/readers are not misled by the current behavior/logging.

---

Nitpick comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Line 1268: The test uses a wall-clock sleep
(std::this_thread::sleep_for(std::chrono::seconds(2))) which makes it
timing-dependent and slow; replace this with a deterministic approach by
injecting a test clock or adding an "expire now" hook so the cache/GC safepoint
logic can be advanced without sleeping — target the code paths around
getGCSafepointIfValid and the safe_point_update_interval_seconds behavior (use a
steady_clock-injectable implementation or call an exposed method to force
expiry) so the test can trigger the same cache refresh immediately and remove
the sleep.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 198-225: The CacheOnly branch and the fast-path cached-hit branch
currently call observe_backoff_count(true) even though no PD request occurs;
remove those calls so the tiflash_gc_safepoint_backoff_count{type=success}
metric is only observed when an actual PD fetch/backoff happens. Specifically,
in PDTiKVClient.h inside the code paths handling
GCSafepointFetchStrategy::CacheOnly and the getGCSafepointIfValid(cache-hit)
branch, eliminate the observe_backoff_count(true) calls and ensure
observe_backoff_count is only invoked in the code path(s) that perform a real PD
fetch (the fallback PD-fetch logic). Use the existing symbols
ks_gc_sp_map.getGCSafepoint, ks_gc_sp_map.getGCSafepointIfValid, and
observe_backoff_count to locate and adjust the calls.

In `@dbms/src/Storages/StorageDeltaMerge.cpp`:
- Line 915: Multiple calls to checkStartTs(mvcc_query_info.start_ts, context,
query_info.req_id, keyspace_id) now read the same cached safepoint under
CacheOnly, so the post-read invocations no longer increase coverage; update the
code comment near the checkStartTs calls (the pre-read/post-read/post-snapshot
invocations) to state explicitly that with CacheOnly the checks observe the same
cached value and that the extra calls are retained as defense-in-depth only (to
catch a rare background refresh/race), rather than for active additional safety,
so future readers understand why we keep the redundant calls.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 08e9eb71-838f-4216-a4d9-5768a598805d

📥 Commits

Reviewing files that changed from the base of the PR and between 0dc254b and c3f1d3c.

📒 Files selected for processing (3)
  • dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
  • dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp
  • dbms/src/Storages/StorageDeltaMerge.cpp

@CalvinNeo CalvinNeo changed the title WIP reduce getGCState freq *: reduce getGCState freq Apr 28, 2026
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 28, 2026
Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp (1)

1279-1279: Optional: reduce fixed wait time in expiry test.

2s works, but for a 1s validity window, a slightly-above-1s wait (e.g. 1100ms) keeps semantics and trims test runtime.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp` at line 1279, Replace
the long fixed 2s sleep in the expiry test's std::this_thread::sleep_for call
with a slightly-above-1s duration (e.g., 1100ms) so the test still exceeds the
1s validity window but runs faster; locate the std::this_thread::sleep_for(...)
invocation in the expiry test and change the duration to
std::chrono::milliseconds(1100).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Around line 1223-1261: The test leaves
PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) as a trailing cleanup
that won't run if an ASSERT_* aborts the test; make cleanup RAII-safe by
creating a small scope guard (e.g., KeyspaceGCSafepointGuard or a
std::unique_ptr with a custom deleter) that calls
PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) in its destructor and
instantiate it immediately after defining keyspace_id; apply the same RAII guard
to the other duplicate cleanup site around lines covered by the second block
(the case around 1265-1304) so the safepoint is always removed even on assertion
failures.

---

Nitpick comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Line 1279: Replace the long fixed 2s sleep in the expiry test's
std::this_thread::sleep_for call with a slightly-above-1s duration (e.g.,
1100ms) so the test still exceeds the 1s validity window but runs faster; locate
the std::this_thread::sleep_for(...) invocation in the expiry test and change
the duration to std::chrono::milliseconds(1100).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3206a305-f1a7-4684-ad8c-8ca7cce269ca

📥 Commits

Reviewing files that changed from the base of the PR and between c3f1d3c and a7b59a0.

📒 Files selected for processing (1)
  • dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp

Comment thread dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp
d
Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp (1)

1226-1263: ⚠️ Potential issue | 🟡 Minor

Make cache cleanup assertion-safe with RAII.

If any ASSERT_* aborts early, the trailing cleanup won’t run and cached safepoint state can leak into later tests.

Proposed fix
 TEST(PDClientHelperTest, CacheOnlyReadPathDoesNotFetchFromPD)
 {
     constexpr KeyspaceID keyspace_id = 9527;
     PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id);
+    SCOPE_EXIT({ PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id); });

@@
-    PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id);
 }

 TEST(PDClientHelperTest, CacheOnlyReadPathCanReturnExpiredCache)
 {
     constexpr KeyspaceID keyspace_id = 9528;
     PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id);
+    SCOPE_EXIT({ PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id); });

@@
-    PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id);
 }
Based on learnings: Ensure `TearDown()` is called or cleanup helpers are used to avoid side effects on other tests.

Also applies to: 1268-1305

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp` around lines 1226 -
1263, Wrap the PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) cleanup in
an RAII/TearDown-safe mechanism so it always runs even if ASSERT_* aborts;
specifically, create a scoped cleanup (or move the test into a test fixture and
implement TearDown) that calls
PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) and use it around the
calls to PDClientHelper::getGCSafePointWithRetry and related assertions
(references: PDClientHelper::removeKeyspaceGCSafepoint,
PDClientHelper::getGCSafePointWithRetry, and the failing test in
gtest_new_kvstore.cpp); apply the same RAII/fixture cleanup to the other similar
block noted (lines ~1268-1305).
🧹 Nitpick comments (1)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp (1)

1281-1281: Prefer controlled staleness over wall-clock sleep.

std::this_thread::sleep_for here adds avoidable test latency and can introduce CI timing flakiness; consider a failpoint/sync-point based trigger for cache-expiry simulation.

Based on learnings: storage engine tests should rely on failpoints (and SyncPointCtl) to simulate timing/race conditions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp` at line 1281, The test
currently uses std::this_thread::sleep_for to simulate cache expiry, which adds
latency and flakiness; replace that wall-clock sleep by driving a deterministic
test synchronisation: remove std::this_thread::sleep_for and instead trigger the
cache-expiry path via a failpoint or SyncPointCtl (e.g., enable/trigger a named
failpoint or SyncPoint and wait for its notification) so the test explicitly
advances the component under test (cache invalidation/expiry) and blocks on the
sync-point rather than sleeping; look for occurrences of
std::this_thread::sleep_for in gtest_new_kvstore.cpp and wire the test to use
the existing FailPoint/SyncPointCtl APIs to simulate the timing event.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Around line 1226-1263: Wrap the
PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) cleanup in an
RAII/TearDown-safe mechanism so it always runs even if ASSERT_* aborts;
specifically, create a scoped cleanup (or move the test into a test fixture and
implement TearDown) that calls
PDClientHelper::removeKeyspaceGCSafepoint(keyspace_id) and use it around the
calls to PDClientHelper::getGCSafePointWithRetry and related assertions
(references: PDClientHelper::removeKeyspaceGCSafepoint,
PDClientHelper::getGCSafePointWithRetry, and the failing test in
gtest_new_kvstore.cpp); apply the same RAII/fixture cleanup to the other similar
block noted (lines ~1268-1305).

---

Nitpick comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Line 1281: The test currently uses std::this_thread::sleep_for to simulate
cache expiry, which adds latency and flakiness; replace that wall-clock sleep by
driving a deterministic test synchronisation: remove std::this_thread::sleep_for
and instead trigger the cache-expiry path via a failpoint or SyncPointCtl (e.g.,
enable/trigger a named failpoint or SyncPoint and wait for its notification) so
the test explicitly advances the component under test (cache
invalidation/expiry) and blocks on the sync-point rather than sleeping; look for
occurrences of std::this_thread::sleep_for in gtest_new_kvstore.cpp and wire the
test to use the existing FailPoint/SyncPointCtl APIs to simulate the timing
event.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 29f9d944-908f-4701-9e71-0b671477d651

📥 Commits

Reviewing files that changed from the base of the PR and between a7b59a0 and 7aa11f4.

📒 Files selected for processing (1)
  • dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp

throw pingcap::Exception("not implemented", pingcap::ErrorCodes::UnknownError);
}

bool isMock() override { return false; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why return false here

const auto min_interval = std::max(static_cast<Int64>(1), safe_point_update_interval_seconds);
auto ks_gc_info = ks_gc_sp_map.getGCSafepointIfValid(keyspace_id, min_interval);
if (ks_gc_info.has_value())
if (fetch_strategy == GCSafepointFetchStrategy::CacheOnly)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems if the input argument is ignore_cache = false and fetch_strategy = CacheOnly, the CacheOnly will be ignored

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should totally remove the param ignore_cache?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JaySon-Huang I think I can change all remaining two occurrence into false, then I will try to rerun the test to see if there is any error.

namespace
{

class CountingPDClient : public pingcap::pd::IClient
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use failpoint instead of constructing a whole CountingPDClient just for testing?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I let AI generate one version for me, and I think it not very good.
Because we need to add hooks including:

  • force_pd_get_gc_state_safe_point
  • force_pd_get_gc_state_resp_error
  • force_pd_get_gc_state_throw
  • force_pd_get_gc_state_count

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JaySon-Huang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 29, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 29, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-29 08:43:31.148845327 +0000 UTC m=+2760216.354205384: ☑️ agreed by JaySon-Huang.

@ti-chi-bot ti-chi-bot Bot added the approved label Apr 29, 2026
@CalvinNeo CalvinNeo requested a review from windtalker April 29, 2026 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. severity/major size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Too frequent getGCState

3 participants